1 Introduction

PathAnalyser is an intuitive and user-friendly package that allows users to assess pathway activity of transcriptomic data sets using a pathway gene signature. PathAnalyser uses GSVA to estimate expression abundance of the up-regulated and down-regulated gene sets of the gene signature for each sample in a transcriptomic data set, and then ranks samples based on expression abundance of the up-regulated and down-regulated gene sets of the gene signature separately. The classification algorithm uses user-configurable thresholds to tailor the pathway-based classification to their own classification needs. Unlike other available pathway assessment packages which do not distinguish between up-regulated and down-regulated gene sets, the PathAnalyser classification algorithm considers both parts of a gene signature: the up-regulated and down-regulated genes separately in assessing consistency of expression with the gene signature.

Other features unique to PathAnalyser package include:

  1. Multiple samples are considered for the classification analysis, building upon the sample-wise analysis provided by the GSVA package

  2. Unlike other classification algorithms, PathAnalyser does not directly force classification of samples into active or inactive pathway classes

  3. Detailed evaluation of the pathway-based sample classification including descriptive statistics such as accuracy, percentage classified, recall and ROC curve visualization

  1. Classification visualization - PCA plots can be generated for visualization of the predicted pathway classes (active/inactive) for each sample.

2 General Workflow

A flow chart of the general workflow of the analysis conducted using PathAnalyser is illustrated below:

PathAnalyser workflow

Figure 1: PathAnalyser workflow

Note: The classification evaluation step is optional and can only be performed if actual pathway activity class labels (e.g. active, inactive or uncertain) for the corresponding pathway are present.

2.1 Overview of PathAnalyser functionality

The PathAnalyser package functionality can be sub-divided into 5 broad categories corresponding to a typical workflow for assessing pathway activity of samples:

  1. Input functions for expression data and gene signatures
  2. Quality control and pre-processing of input data
  3. Classification of samples by pathway activity
  4. Evaluation of classification
  5. Visualization

3 Installation and setup

3.1 Installing dependencies

To install all required dependencies of PathAnalyser, type the following in R:

# If not already installed
install.packages("BiocManager")

BiocManager::install(c("GSVA", "pROC", "edgeR", "reshape2", "ggplot2","limma", 
                       "VennDiagram", "NCmisc", "futile.logger"), 
                     dependencies = TRUE)

3.2 Installing package using devtools

The easiest way to install PathAnalyser is to install directly from GitHub using the devtools package in R:

install.packages("devtools")
devtools::install_github("ozlemkaradeniz/PathAnalyser")

Alternatively, you can clone the GitHub repository in a Linux / MacOS terminal:

git clone git@github.com:a-thind/PathAnalyser.git

Then type the following in R to install a local source version of the package:

library(utils)
install.packages("~/PathAnalyser", repo=NULL, type="source")

Once installation is completed, load the library to start using PathAnalyser:

library(PathAnalyser)

4 Data formats

The classification algorithm of PathAnalyser requires two types of input data:

  1. Gene expression matrix - a gene-by-sample gene expression matrix, with row names corresponding to gene names and column names corresponding to sample names / IDs.
  2. gene signature data frame - a data frame of expression values corresponding to a list of genes associated with a pathway. The first column contains gene names and the second column contains expression values: -1 for down-regulated or 1 for up-regulated genes when a specific pathway is active.

4.1 Input files

The two data types described above (gene expression matrix and gene signature dataframe) can be created from two types of input files:

  1. Gene expression matrix file - a tab-delimited text file containing a gene expression matrix, where rows represent genes and columns samples of a gene expression data set. The gene expression matrix text file can contain expression values from RNASeq or normalized microarray data. The image below is a snapshot of an example expression data set file:
knitr::include_graphics("expr_dataset_example.png")
An example gene expression matrix text file

Figure 2: An example gene expression matrix text file

The first line contains the sample names and the first column contains gene names. All other tab-separated fields represent expression values for each gene for each sample from either RNASeq or microarray experiments.

Note: PathAnalyser does not provide functionality for unnormalized microarray data, so the user must ensure the microarray data is normalized prior to performing classification by PathAnalyser.

  1. Gene signature files - this data is provided as two text files containing a list of genes provided in the gene set file format (using .grp file extension) - one for the up-regulated gene set and another containing the down-regulated gene set of the pathway signature. A plethora of pathway gene signatures in the gene set file format can be found at MSigDB. The snapshot below shows the first 30 lines of the gene set file for the up-regulated gene set of the ERBB2 (HER2) gene signature.
knitr::include_graphics("up_gene_signature.png")
An example gene set file format containing either the up- or down-regulated gene sets of the gene signature

Figure 3: An example gene set file format containing either the up- or down-regulated gene sets of the gene signature

Note: read_signature_data from PathAnalyser only accepts gene signatures that have the up-regulated and down-regulated gene sets in separate gene set format text files (with file extension .grp). The user is expected to provide one up-regulated and one down-regulated gene set file for a pathway gene signature.

5 Built-in datasets and gene signatures

5.1 Gene expression signatures

PathAnalyser contains two gene signatures for active ER and HER2 pathways that can be used to assess ER and HER2 pathway activity in a breast cancer transcriptomic data set. These signatures are accessible by typing the following in R:

data("ER_sig_df")
data("HER2_sig_df")

5.1.1 ER signature

The built-in ER signature ER_sig_df is a data frame containing the names of 160 differentially expressed genes which constitute the transcriptional signature of ER pathway activation. This signature was obtained from the sensitivity to endocrine therapy genome index defined by Symanns et al. (2011). The total number of genes that are up-regulated (denoted with an expression value of 1) and down-regulated (denoted with an expression value of -1) in this gene signature are 101 and 59 genes respectively.

dim(ER_sig_df)
#> [1] 160   2
head(ER_sig_df)
#>     gene expression
#> 1   ABAT          1
#> 2 ACADSB          1
#> 3  ADCY1          1
#> 4  ADCY9          1
#> 5   AGR2          1
#> 6  ANXA9          1
# up-regulated genes are given an expression value of 1
ER_up_sig <- ER_sig_df[ER_sig_df$expression == 1 ,]
dim(ER_up_sig)
#> [1] 101   2
# down-regulated genes are given an expression value of -1
ER_dn_sig <- ER_sig_df[ER_sig_df$expression == -1 ,]
dim(ER_dn_sig)
#> [1] 59  2
# for more details
?ER_sig_df

5.1.2 HER2 signature

A gene expression signature for HER2 (ERBB2) pathway activation HER2_sig_df is also provided by PathAnalyser. This gene signature was obtained from Molecular Signatures Database (MSigDB) by combining the up-regulated (gene set: ERBB2_UP.V1_UP ) and down-regulated (gene set: ERBB2_UP.V1_DN ) gene sets available at MSigDB, originating from transcriptomic profiling of a MCF-7 breast cancer cell line, positive for the estrogen receptor (ER) but genetically engineered to express the ERBB2 (HER2) receptor.

The names of 387 differentially expressed genes are in the HER2 signature data frame provided by PathAnalyser, of which 190 are up-regulated (denoted by an expression of 1) and 197 are down-regulated (denoted by an expression of -1).

dim(HER2_sig_df)
#> [1] 387   2
head(HER2_sig_df)
#>      gene expression
#> 1   ABCC3          1
#> 2   ABCG1          1
#> 3    ACHE          1
#> 4 AKR1B10          1
#> 5  AKR1C2          1
#> 6 ALDH4A1          1
# up-regulated genes are given an expression value of 1
HER2_up_sig <- HER2_sig_df[HER2_sig_df$expression == 1 ,]
dim(HER2_up_sig)
#> [1] 190   2
# Down-regulated genes are given an expression value of -1
HER2_dn_sig <- HER2_sig_df[HER2_sig_df$expression == -1 ,]
dim(HER2_dn_sig)
#> [1] 197   2
# for more details
?HER2_sig_df

5.2 Gene expression datasets

PathAnalyser also contains two built-in gene expression matrices each containing RNA-seq raw read counts for primary breast cancer samples obtained from 20 individuals (cases). Data for these matrices were obtained from The Cancer Genome Atlas (TCGA). Each expression matrix contains 20,124 genes.

data("ER_data_se1")
data("HER2_data_se1")
# column names represent case (sample) IDs from TCGA

5.2.1 ER dataset 1

The ER data set ER_data_se1 contains RNASeq raw read counts for 20 primary breast cancer samples, 10 of which have ER pathway activity (ER+) and 10 which have inactive ER pathway activity (ER-).

dim(ER_data_se1)
#> [1] 20124    20
# Expression data for first 6 genes
head(ER_data_se1)
#>         TCGA.AR.A0TY TCGA.AO.A124 TCGA.BH.A0BG TCGA.AN.A0G0 TCGA.AO.A0J8
#> OR5H15             0            0            0            0            0
#> SPANXN2            0            0            0            0            0
#> TAF7           16841         6974         8211         5586        12210
#> TMEM107          189          851          581          433         1298
#> CAMK2A            10           20           18           27           19
#> TRAM1L1          233          198          426          328         1044
#>         TCGA.D8.A1XZ TCGA.AO.A0JE TCGA.E9.A5FL TCGA.LL.A8F5 TCGA.C8.A12V
#> OR5H15             0            0            0            0            0
#> SPANXN2            0            0            0            0            0
#> TAF7            8627         9138         3219         7663         6867
#> TMEM107          629          253          226         1265          385
#> CAMK2A            30           47           72           45           18
#> TRAM1L1          385          467          153          342           33
#>         TCGA.A2.A0CM TCGA.AC.A3TN TCGA.AC.A3QQ TCGA.A2.A0YF TCGA.E2.A15F
#> OR5H15             0            0            0            0            0
#> SPANXN2            0            0            0            0            0
#> TAF7            4416         2869          718        14649         7106
#> TMEM107          332          660          230         1738          319
#> CAMK2A            29           34           25            4           48
#> TRAM1L1          212          482           28          428          507
#>         TCGA.AO.A1KQ TCGA.BH.A0E0 TCGA.BH.A18G TCGA.5L.AAT0 TCGA.A8.A07C
#> OR5H15             0            0            0            0            0
#> SPANXN2            0            0            0            0            0
#> TAF7            6351         3150         5278         4839         9226
#> TMEM107          633          234          245          284          565
#> CAMK2A            32           34           23           43            9
#> TRAM1L1          169           71          220          272          615

5.2.2 HER2 dataset 1

Similarly, the HER2 data set HER2_data_se1 contains RNASeq raw read counts for 20 primary breast cancer samples, 10 of which have HER2 (ERBB2) pathway activity (HER2+) and 10 which have inactive HER2 pathway activity (HER2-).

dim(HER2_data_se1)
#> [1] 20124    20
# Expression data for first 6 genes
head(HER2_data_se1)
#>         TCGA.A8.A090 TCGA.BH.A0AU TCGA.A2.A4S2 TCGA.C8.A1HM TCGA.GM.A2DK
#> OR5H15             0            0            0            0            0
#> SPANXN2            0            0            0            0            0
#> TAF7            4404         3614         5882         6380         5080
#> TMEM107          274          171          361          428          352
#> CAMK2A            42           30           21           13           20
#> TRAM1L1          698          514          367           84          265
#>         TCGA.E9.A22D TCGA.AN.A0XV TCGA.BH.A0DG TCGA.AN.A041 TCGA.BH.A18P
#> OR5H15             0            0            0            0            0
#> SPANXN2            0            0            0            0            0
#> TAF7           14990         3110        10206         6425         7683
#> TMEM107         1058          742          488          595          239
#> CAMK2A             9           36           50           90           32
#> TRAM1L1          776          693          399          285          510
#>         TCGA.AC.A23C TCGA.A8.A0A9 TCGA.AR.A0U3 TCGA.A8.A081 TCGA.BH.A0HQ
#> OR5H15             0            0            0            0            0
#> SPANXN2            0            1            0            0            0
#> TAF7           11091         2000         4663         8107         5103
#> TMEM107          418          715          630          943          774
#> CAMK2A            87           58            6           38           63
#> TRAM1L1          473          598          135          251          253
#>         TCGA.D8.A1XO TCGA.A8.A097 TCGA.C8.A1HK TCGA.AC.A23H TCGA.BH.A0BC
#> OR5H15             0            0            0            0            0
#> SPANXN2            0            0            0            0            0
#> TAF7           10467        14549        10020         8397        10158
#> TMEM107         1015         1228          149          367          385
#> CAMK2A            34          117           15           50           26
#> TRAM1L1          544          767          335          128          556

6 Reading input data using PathAnalyser

There are two types of input required for pathway activity assessment using PathAnalyser:

  1. gene expression data matrices
  2. gene expression signatures

6.1 Gene expression data

The read_expression_data function reads gene expression data from
a file with an delimiter, the function checks the delimiter of the file and reads the file accordingly. It returns an output with gene symbols/IDs as the row names and columns representing each sample IDs in the dataset.

data_se <- read_expression_data("/Users/taniyapal/Documents/Group Project/TCGA_unannotated.txt")
data_se<-data_se[,1:200]
print(data_se[1:5, 1:5])

6.2 Gene signature data

Gene signature files can be read by the read_signature_data function. It returns a dataframe with signature IDs/symbols in the first column and their expression pattern is represented as up r regulated or down regulated by +1 and -1 respectively.

sig_df <- read_signature_data("/Users/taniyapal/Documents/Group Project/ESR1_UP.v1._UP.csv", "/Users/taniyapal/Documents/Group Project/ESR1_DN.v1_DN.csv")
head(sig_df)

6.3 Quality control and pre-processing of data

6.3.1 Log Counts Per Million (CPM) transformation of RNASeq expression matrices

The log_cpm_transformation function transforms the raw data by the method of log CPM and returns the transformed data. It also performs sanity check of the transformation by returning box plots for visualization of distribution of data before and after transformation.

# using toy data set as expression matrix
data("ER_data_se1")
data_se <- ER_data_se1 
normalized_se <- log_cpm_transformation(data_se)

6.3.2 Checking and filtering genes from the gene expression matrix

The gene names are first checked for inclusion in the gene signature data frame. Genes that are included in the gene signature are retained then subjected to further filtration, in which their expression across the data set measured. If a gene is not present in at least 10% of the total number of samples in the data set, the gene is dropped from the expression matrix. A console message is printed reporting the number of features (genes) retained in the final normalized expression matrix.

normalized_se <- check_signature_vs_dataset(normalized_se, ER_sig_df)
#> Number of feature present in expression dataset: 146

7 Classifying samples by pathway activity using gene signatures

7.1 Overview of pathway activity classification

PathAnalyser provides assessment of pathway activity for each sample in a gene expression data set, by extending an existing unsupervised and non-parametric sample-wise gene set enrichment method, Gene set variation analysis (GSVA). GSVA measures the variation of gene set enrichment across a sample population in a manner independent of class labels, and summarizes the sample-wise expression of the gene set in the form of gene set enrichment scores for each sample (Hänzelmann et al. 2013). However, unlike GSVA, which does not distinguish up-regulated gene sets from down-regulated gene sets, PathAnalyser considers the expression of the up-regulated and down-regulated gene sets of a given pathway’s gene signature both separately and explicitly, using the GSVA method solely for estimating the relative abundance of expression of the up- and down-regulated gene sets for a given pathway in each sample.

7.2 Overall PathAnalyser classification algorithm

The estimated expression abundance for both these gene sets is used to check expression consistency with the gene signature and infer pathway activity for a given sample in the transcriptomic data set.

The classification algorithm implemented in PathAnalyser uses four thresholds for the classification algorithm (two for each gene set of the pathway signature):

  1. High expression abundance for up-regulated gene set
  2. Low expression abundance for up-regulated gene set
  3. High expression abundance for down-regulated gene set
  4. Low expression abundance for down-regulated gene set

The diagram below summarises the classification algorithm employed by PathAnalyser to classify samples by pathway activity. We include an external image with the R function:

Overiew of algorithm diagram

Figure 4: Overiew of algorithm diagram

Samples are ordered by expression abundance of up-regulated and down-regulated genes of the gene signature separately using the GSVA scores. Those samples which have the most abundant expression of the up-regulated gene set (exceeding threshold 1) show evidence of activation of the pathway but are only classified as having the pathway “Active”, if the same samples show a low abundance of expression of the down-regulated genes (below threshold 4) of the gene signature i.e. demonstrating consistency between evidence from up- and down-regulated gene sets of the gene signature. Similarly, samples with evidence of pathway inactivation from the up-regulated gene set of the gene signature i.e. the least abundant expression of the up-regulated genes of the gene signature (below threshold 2) and which also show evidence of pathway inactivation from the down-regulated gene set of the signature i.e.  most abundance expression of the down-regulated gene set of the gene signature (exceeding threshold 3), demonstrate evidence of consistency of pathway inactivation from both gene sets and therefore are classified as “Inactive”. All other samples are classified as “Uncertain” since there is insufficient evidence for pathway activation or inactivation.

In PathAnalyser, the GSVA scores thresholds for classifying a sample as “Active” (consistent with the gene signature) and “Inactive” (inconsistent with the gene signature) can be set manually and depend on the user’s pathway activity classification requirements.

PathAnalyser provides two functions for classifying samples using an absolute GSVA scores thresholds classify_GSVA_abs or percentile thresholds which is effectively rank-based classify_GSVA_percent using percentiles (default at 25%) , as well as a visualization function to view the distributions of the GSVA scores for both gene sets gsva_scores_dist, which will be discussed below.

7.3 Visualising GSVA Score Distribution

PathAnalyser provides gsva_scores_dist method to visualize the GSVA score distribution for the abundance of expression of the up-regulated and down-regulated gene sets of the gene signature for each sample. The resulting density plot usually follows a bimodal distribution of GSVA scores for each sample for the up-regulated and down-regulated gene sets, a consequence of GSVA employing the maximum deviation from zero method to calculate the GSVA score from the Kolmogorov-Smirnov like random walk statistic. Figure 5 shows two peaks for each gene set when running gsva_scores_dist with logCPM-normalized ER_data_se1 gene expression matrix. For both gene sets, the peak with the highest GSVA scores represents samples with high abundance of expression of the gene set and the lower score peak represents samples with low abundance of expression of the gene set.

gsva_scores_dist(ER_sig_df, normalized_se)
Density plots showing the distribution of GSVA scores for samples in logCPM-normalised `ER_data_se1` data set for the up-regulated (left) and down-regulated gene set (right) of the pathway signature using `gsva_scores_dist` function.

Figure 5: Density plots showing the distribution of GSVA scores for samples in logCPM-normalised ER_data_se1 data set for the up-regulated (left) and down-regulated gene set (right) of the pathway signature using gsva_scores_dist function

7.3.1 Determining thresolds for classification

As the two peaks represent represent low expression abundance and high abundance expression, the distribution of GSVA scores that constitute the “valley” i.e. the area between the two peaks (fig. 6), represent samples that exhibit relatively weaker evidence of differential expression abundance relative to other samples and the algorithm would classify these samples as having “Uncertain” pathway activity. Because the peaks represent the mode GSVA scores (one for high and low expression abundance), thresholds should be selected between these two peak to enable the algorithm to classify samples around these modes as being consistent or inconsistent with the pathway signature.

As mentioned above, there are four thresholds that can be tuned by the user: the high and low expression abundance for the up-regulated and down-regulated gene sets. For example, a user could set the high expression abundance thresholds for the up-regulated gene set to -0.2 for the low expression threshold for both gene sets, and 0.2 for the high expression thresholds for both gene sets. Samples with GSVA scores between the these thresholds for both the up-regulated and down -regulated gene-set would be considered by the classification algorithm as having “Uncertain” pathway activity.

plot <- gsva_scores_dist(ER_sig_df, normalized_se)
# Add thresholds on plot
library(ggplot2)
data_threshs <- data.frame(Geneset=c("Up", "Down"), vline=c(-0.2, 0.2))
plot + geom_vline(xintercept=data_threshs$vline, linetype=2)
GSVA scores distributions of the samples for the down-regulated (left) and up-regulated gene set (right) of the pathway gene signature, showing absolute thresholds of -0.2 and 0.2 for characterising low and high expression abundance of both gene sets respectively. The expression data set used is the logCPM-normalised `ER_data_se1` gene expression data set.

Figure 6: GSVA scores distributions of the samples for the down-regulated (left) and up-regulated gene set (right) of the pathway gene signature, showing absolute thresholds of -0.2 and 0.2 for characterising low and high expression abundance of both gene sets respectively
The expression data set used is the logCPM-normalised ER_data_se1 gene expression data set.

Shrinking the distance between the high and low expression thresholds would result in fewer “Uncertain” pathway activity classifications (fig. 7), because more samples would meet the consistency/ inconsistency expression thresholds for the pathway signature (high expression thresholds would be lower, low expression thresholds would be higher for each gene set).

# more relaxed thresholds, fewer uncertain labels
data_threshs <- data.frame(Geneset=c("Up", "Down"), 
                           vline=c(-0.1, 0.1))
plot + geom_vline(xintercept=data_threshs$vline, linetype=2)
GSVA scores distributions of samples with more relaxed thresholds for assessing expression consistency of the down-regulated gene set (left) and the up-regulated gene set (right) with the pathway gene expression signature. Low and high expression abundance thresholds for the down-regulated gene set (left) are set to -0.1 and 0.1 respectively, while low and high expression abundance thresholds for the up-regulated gene set (right) are also set to -0.1 and 0.1 respectively. Data is generated from running GSVA on logCPM-normalised `ER_data_se1` expression dataset with ER signature (`ER_sig_df`).

Figure 7: GSVA scores distributions of samples with more relaxed thresholds for assessing expression consistency of the down-regulated gene set (left) and the up-regulated gene set (right) with the pathway gene expression signature
Low and high expression abundance thresholds for the down-regulated gene set (left) are set to -0.1 and 0.1 respectively, while low and high expression abundance thresholds for the up-regulated gene set (right) are also set to -0.1 and 0.1 respectively. Data is generated from running GSVA on logCPM-normalised ER_data_se1 expression dataset with ER signature (ER_sig_df).

Conversely, expanding the distance between the thresholds would increase the frequency of “Uncertain” pathway activity classifications, as the number of samples meeting the thresholds for consistency and inconsistency would be reduced.

# more stringent thresholds, greater uncertain labels
data_threshs <- data.frame(Geneset=c("Up", "Down"), vline=c(-0.4, 0.4))
plot + geom_vline(xintercept=data_threshs$vline, linetype=2)
Density plots for GSVA scores distributions for samples with relatively more stringent thresholds for assessing consistency of the up-regulated gene set (left) and down-rgulated gene set (right) with the pathway signature. The low and high expression abundance thresholds are -0.4 and 0.4 for both up-regulated and down-regulated gene sets respectively. Data is generated from running GSVA on logCPM-normalised `ER_data_se1` expression dataset with ER signature (`ER_sig_df`).

Figure 8: Density plots for GSVA scores distributions for samples with relatively more stringent thresholds for assessing consistency of the up-regulated gene set (left) and down-rgulated gene set (right) with the pathway signature
The low and high expression abundance thresholds are -0.4 and 0.4 for both up-regulated and down-regulated gene sets respectively. Data is generated from running GSVA on logCPM-normalised ER_data_se1 expression dataset with ER signature (ER_sig_df).

7.4 Classification using absolute GSVA score thresholds

As the distribution of GSVA scores tends to be biomodal rather than Guassian, the user may prefer to use absolute GSVA thresholds for checking consistency of expression abundance of each gene set with pathway signature for each sample. PathAnalyser provides the classify_GSVA_abs function for classifying samples by pathway activity using absolute GSVA score thresholds which can be tuned by the user:

  • up_thresh.high: for high expression abundance for the up-regulated gene-set
  • up_thresh.low: for low expression abundance for the up-regulated gene-set of the gene signature
  • dn_thresh.high: for high expression abundance for the down-regulated gene-set of the gene signature
  • dn_thresh.low: for low expression abundance for the down-regulated gene-set of the gene signature

Note that absolute thresholds are required from the user when running classify_GSVA_abs and can be positive or negative numbers. A summary of the classification is printed to the console highlighting the number of samples in each pathway activity class (“Active”, “Inactive” and “Uncertain”) and the number of samples in total.

classes_df <- classify_GSVA_abs(ER_sig_df, normalized_se, 
                                    up_thresh.low=-0.2, up_thresh.high=0.2, 
                                    dn_thresh.low=-0.2, dn_thresh.high=0.2)
#> Summary of sample classification based on pathway activity:
#> --------------------------------------------------------------
#> Number of samples in each pathway activity class:
#> 
#>    Active  Inactive Uncertain 
#>         9        10         1 
#> 
#> Total number of samples: 20
#> Total number of samples classified: 19

7.5 Classification using percentile GSVA score thresholds

The package also provides a GSVA-based classification method classify_GSVA_percent, which uses percentile ranks as thresholds rather than absolute GSVA thresholds. The default percentile used by the function is 25%, so the high expression abundance thresholds in both gene sets would be 75%, by default and low expression abundance thresholds in both gene sets would be 25%, default. A custom percentile threshold can be provided using the optional thresh_percent parameter. As for the absolute threshold function, a summary of the classification is printed to the console highlighting the number of samples in each pathway activity class (“Active”, “Inactive” and “Uncertain”) and the number of samples in total. Increasing the percentile threshold for classification has the equivalent effect of reducing the distance between the high expression abundance and low expression abundance thresholds for both gene sets, therefore reducing the frequency of “Uncertain” classifications.

# default percentile = 25% (quartile)
classes_df.percent <- classify_GSVA_percent(ER_sig_df, normalized_se)
#> Summary of sample classification based on pathway activity:
#> --------------------------------------------------------------
#> Number of samples in each pathway activity class:
#> 
#>    Active  Inactive Uncertain 
#>         4         5        11 
#> 
#> Total number of samples: 20
#> Total number of samples classified: 9
# custom percentile = 30%
classes_df.percent <- classify_GSVA_percent(ER_sig_df, normalized_se, 
                                            percent_thresh=30)
#> Summary of sample classification based on pathway activity:
#> --------------------------------------------------------------
#> Number of samples in each pathway activity class:
#> 
#>    Active  Inactive Uncertain 
#>         6         6         8 
#> 
#> Total number of samples: 20
#> Total number of samples classified: 12

8 Visualising pathway activity

An interactive PCA plot providing a visual summary of the pathway-based classification can be produced by using the classes_pca function with the normalized data set and predicted classes data frame. Each point represent a sample in the expression data set and are colored according to the predicted activity class (“Active”, “Inactive”, “Uncertain”). Hovering over the sample points displays the sample name, along with predicted pathway activity class and also the relative coordinates of the sample in the first two principal components.

8.1 Generating PCA plot

classes_pca(normalized_se, classes_df, pathway_name = "ER")

9 Evaluating pathway activity classification

This package provides a method calculate_accuracy for evaluating pathway activity classification, which gets the actual labels and the labels classified by the classification methods as parameters; then creates confusion matrix based on them. The first parameter true_labels_source could be file-name, matrix or data frame and contains sample names and their corresponding actual labels.The second parameter classes_df could be matrix or data frame and contains classified labels found by the classification methods classify_GSVA_percent or classify_GSVA_abs. Third parameter pathway specifies with which pathway the labels(actual and classified) are associated. Other parameters display_statistics and display_roc_curve are boolean flags used as optional features to display statistics and ROC Curve respectively. Default values for the flags is FALSE.

true_labels_df <- read.table("Sample_labels.txt", sep="\t", header=T)
confusion_matrix <- calculate_accuracy(true_labels_df, classes_df, 
                                       pathway = "ER", display_statistics=TRUE, 
                                       display_roc_curve=TRUE)
#> [1] "Confusion Matrix for  ER  pathway"
#> [1] "--------------------------------------------------------------"
#>                      Actual Positive Actual Negative Actual Uncertain
#> Prediction Positive                9               0                0
#> Prediction Negative                1               9                0
#> Prediction Uncertain               0               1                0
#> [1] "--------------------------------------------------------------"
#> [1] "Statistics in Confusion Matrix "
#> [1] "--------------------------------------------------------------"
#> [1] "Proportion of classified samples: 95.00"
#> [1] "Accuracy amongst classified samples: 94.74"
#> [1] "True Positive(TP): 9"
#> [1] "True Negative(TN): 9"
#> [1] "False Negative(FN): 1"
#> [1] "False Positive(FP): 0"
#> [1] "--------------------------------------------------------------"
#> [1] "True Positive Rate(TPR)(sensitivity)(Recall): 90.00"
#> [1] "True Negative Rate(TNR)(specificity): 100.00"
#> [1] "Precision (Positive predictive value): 100.00"
#> [1] "False Positive Rate(FPR): 0.00"
#> [1] "False Negative Rate(FNR): 10.00"
#> [1] "--------------------------------------------------------------"
#>  Actual Positive Actual Negative Actual Uncertain
#>  Min.   :0.000   Min.   :0.000   Min.   :0       
#>  1st Qu.:0.500   1st Qu.:0.500   1st Qu.:0       
#>  Median :1.000   Median :1.000   Median :0       
#>  Mean   :3.333   Mean   :3.333   Mean   :0       
#>  3rd Qu.:5.000   3rd Qu.:5.000   3rd Qu.:0       
#>  Max.   :9.000   Max.   :9.000   Max.   :0

10 Example: Assessing ER pathway activity in a breast cancer RNAseq dataset

To demonstrate the standard workflow of assessing pathway activity in a transcriptomic data set using PathAnalyser, we will use PathAnalyser to classify samples in an RNA seq data set according to ER pathway activity as (“Active”, “Inactive” or “Uncertain”) for the ER pathway. The ER pathway status of tumors is of particular interest since, activation of the pathway is correlated sensitivity to endocrine therapy and disease prognosis more broadly. The data set collected from The Genome Atlas Project (TCGA) contains raw read counts for over 20,000 genes (almost the complete human transcriptome) for 20 primary breast cancer samples. Ten samples were shown have the ER pathway active (ER+), while the remaining ten samples were reported to be inactive for the ER pathway (ER-).

10.1 Read RNASeq gene expression dataset and ER pathway signature data

First, the gene expression matrix data set text file which contains raw read counts for the 20 breast cancer samples and ER gene signature text files (up-regulated and down-regulated gene set files) are read into the expression matrix and gene signature data frame data types respectively. The gene expression matrix contains raw read counts for 20,124 genes and 20 samples. Additionally, the ER signature compiled from the Sensitivity to Endocrine Therapy (SET) index proposed by Symanns et al. (2011) contains 160 genes of which 59 are down-regulated and 101 are up-regulated.

# Load transcriptomic data set (gene expression matrix of samples)
data_se <- read_expression_data("toy_data.txt")
dim(data_se)
#> [1] 20124    20
head(data_se)
#>         TCGA.AR.A0TY TCGA.AO.A124 TCGA.BH.A0BG TCGA.AN.A0G0 TCGA.AO.A0J8
#> OR5H15             0            0            0            0            0
#> SPANXN2            0            0            0            0            0
#> TAF7           16841         6974         8211         5586        12210
#> TMEM107          189          851          581          433         1298
#> CAMK2A            10           20           18           27           19
#> TRAM1L1          233          198          426          328         1044
#>         TCGA.D8.A1XZ TCGA.AO.A0JE TCGA.E9.A5FL TCGA.LL.A8F5 TCGA.C8.A12V
#> OR5H15             0            0            0            0            0
#> SPANXN2            0            0            0            0            0
#> TAF7            8627         9138         3219         7663         6867
#> TMEM107          629          253          226         1265          385
#> CAMK2A            30           47           72           45           18
#> TRAM1L1          385          467          153          342           33
#>         TCGA.A2.A0CM TCGA.AC.A3TN TCGA.AC.A3QQ TCGA.A2.A0YF TCGA.E2.A15F
#> OR5H15             0            0            0            0            0
#> SPANXN2            0            0            0            0            0
#> TAF7            4416         2869          718        14649         7106
#> TMEM107          332          660          230         1738          319
#> CAMK2A            29           34           25            4           48
#> TRAM1L1          212          482           28          428          507
#>         TCGA.AO.A1KQ TCGA.BH.A0E0 TCGA.BH.A18G TCGA.5L.AAT0 TCGA.A8.A07C
#> OR5H15             0            0            0            0            0
#> SPANXN2            0            0            0            0            0
#> TAF7            6351         3150         5278         4839         9226
#> TMEM107          633          234          245          284          565
#> CAMK2A            32           34           23           43            9
#> TRAM1L1          169           71          220          272          615
# read signature data from the two individual gene set files for up-regulated
# and down-regulated gene sets
ER_sig <- read_signature_data("ESR1_UP.v1._UP.grp", "ESR1_DN.v1_DN.grp")
dim(ER_sig)
#> [1] 160   2

10.2 QC and pre-processing of the expression dataset and ER signature data

As the gene expression matrix file contained raw RNASeq counts, the data must be normalized prior to passing the matrix to any of the classification functions. Normalization of read counts is necessary to account for differences in sequencing depths among the libraries. The PathAnalyser log_cpm_transformation function can be used to perform a log counts per million (CPM) transformation on the gene expression matrix. Furthermore, log_cpm graphically confirms the logCPM transformation has been performed correctly since the box plots displayed by the function, representing the logCPM values for each sample, are more aligned following the logCPM transformation compared to before the transformation.

# Transforming matrix with log cpm transformation and sanity check of the transformation
normalized_se <- log_cpm_transformation(data_se)

To ensure that the gene names in the gene signatures and gene expression data set are consistent, we use the check_signature_vs_dataset gene names to filter out genes from the gene expression matrix that are not in signature and those genes which are not expressed in at least 10% of samples. The output of running the function on the normalised dataset and ER signature shows that 147 genes were retained in the normalised dataset with 13 genes being dropped either due to expression in less than 10% of samples, or due to being absent in the ER signature.

normalized_se <- check_signature_vs_dataset(normalized_se, ER_sig)
#> Number of feature present in expression dataset: 146

dim(normalized_se)
#> [1] 146  20

10.3 Classifying breast cancer samples based on ER pathway activity.

Once the expression matrix is normalised to logCPM and genes with insufficient counts or those that do not feature in the ER signature are removed, the expression data set is ready for classification alongside the ER signature. To acquire an idea of the distribution of GSVA scores, the gene expression matrix can be passed as an argument along with the ER signature to the gsva_scores_dist. A bimodal density plot is produced for the up-regulated and down-regulated gene set of the ER signature, as described before reflective of the underlying maximum deviation from zero method used to generate the GSVA score statistic for representing expression abundance of the corresponding gene set of the ER signature. The peak centered around the GSVA score of -0.8 for the down-regulated gene set and the peak around 0.8 for the down-regulated gene set represent the mode of samples with the low expression abundance of the down-regulated genes of the ER signature and high abundance of the up-regulated genes of the ER signature respectively. Similarly, for the up-regulated genes of the ER pathway signature, the peak situated at around a GSVA score of -0.7 and and around the GSVA score of 0.9 represent samples of low expression abundance and high expression abundance of the up-regulated genes of the ER pathway signature respectively.

gsva_scores_dist(ER_sig, normalized_se)

Samples in a gene expression matrix can be classified according to evidence of expression consistency with the up-regulated and down-regulated gene sets of the ER signature using GSVA by using absolute thresholds for high and low expression for each gene set of the ER signature via classify_GSVA_abs or percentile threshold for high and low expression for each gene set of the ER signature via classify_GSVA_percent. Here, we use the default percentile thresholds of 25% to classify samples as having high or low expression abundance with the up-regulated and down-regulated gene sets of the ER gene signature. The R console output of running the classify_GSVA_percent on our data set and ER signature, shows that of the 20 samples in our data set, 4 samples were classified as active and 5 samples were classified as inactive for the ER pathway. Eleven samples were classified as uncertain. A data frame is produced with names of the 20 samples as the first column and their corresponding predicted ER pathway activity classes as the second column.

classes_df <- classify_GSVA_percent(ER_sig, normalized_se)
#> Summary of sample classification based on pathway activity:
#> --------------------------------------------------------------
#> Number of samples in each pathway activity class:
#> 
#>    Active  Inactive Uncertain 
#>         4         5        11 
#> 
#> Total number of samples: 20
#> Total number of samples classified: 9
head(classes_df)
#>         sample     class
#> 1 TCGA.AC.A3TN    Active
#> 2 TCGA.AO.A0J8    Active
#> 3 TCGA.A2.A0YF    Active
#> 4 TCGA.5L.AAT0    Active
#> 5 TCGA.E2.A15F Uncertain
#> 6 TCGA.AO.A1KQ Uncertain

Depending on the user’s classification requirements the thresholds can be adjusted. For example, if the user would like to reduce the number of “Uncertain” classification samples, the user could provide 50% percentile threshold as an argument e.g.:

classes_df_50p <- classify_GSVA_percent(ER_sig, normalized_se, 
                                        percent_thresh=50)
#> Summary of sample classification based on pathway activity:
#> --------------------------------------------------------------
#> Number of samples in each pathway activity class:
#> 
#>    Active  Inactive Uncertain 
#>        10        10         0 
#> 
#> Total number of samples: 20
#> Total number of samples classified: 20

From the output of running the classification function with 50% percentile, it is evident that no samples were classified as “Uncertain”, with 10 samples classified as “Active” and 10 samples classified as “Inactive” by the classification algorithm.

Alternatively, the breast cancer samples in the data set could be classified by ER pathway activity, using absolute GSVA score thresholds for the up-regulated and down-regulated gene sets of the ER signature respectively.

classes_df_abs <- classify_GSVA_abs(ER_sig, normalized_se, up_thresh.low = -0.2, 
                                    up_thresh.high = 0.2, dn_thresh.low = -0.2, 
                                    dn_thresh.high = 0.2)
#> Summary of sample classification based on pathway activity:
#> --------------------------------------------------------------
#> Number of samples in each pathway activity class:
#> 
#>    Active  Inactive Uncertain 
#>         9        10         1 
#> 
#> Total number of samples: 20
#> Total number of samples classified: 19

10.4 Visualizing ER pathway activity classification of the breast cancer samples

To acquire a visual summary of the ER pathway-based classification of our 20 breast cancer samples, we can run the classes_pca function to produce an interactive PCA plot of the samples coloured by the predicted ER pathway activity status (active, inactive, uncertain).

classes_pca(normalized_se, classes_df, pathway_name = "ER")

The PCA plot shows the samples predicted as active and inactive form relatively tight clusters in the first principal component (PC), as would be expected since the inter-variation within the active and inactive classes would be smaller than between the different classes. Interestingly, the active and inactive clusters are in opposite quadrants in the first PC indicating the samples have been separated according to pathway activity well in the first PC. Furthermore, samples classified as uncertain are more scattered in the first PC, with a few samples present in the active / inactive clusters. Overall the PCA plot shows the samples have clustered well according to the predicted pathway activity classes. Sample names and their corresponding predicted pathway activity class can be displayed for a given sample on the plot by hovering over a sample point using a mouse pointer.

10.5 Assessing the accuracy of the pathway-based classification

A more detailed assessment of the ER pathway based classification of the breast cancer samples can be performed by providing the true pathway activity class labels for each sample, along with predicted pathway activity classes classes_df to the calculate_accuracy function. These “true labels” must only contain (“Active”, “Inactive or”Uncertain”) for the calculate_accuracy function to correctly assess the pathway-based classification. The pathway parameter is set to “ER” since ER pathway activity is being assessed. The function outputs a confusion matrix reporting in tabular form, the number of true positives (TP), true negative(TN), false positives (FP), false negatives (FN) and also a breakdown of the number of uncertain classifications and their actual pathway activity status. Further classification evaluation using the display_statistics reveals an accuracy of 100% of among the 9 classified samples, the 4 samples classified as ER positive were actually ER positive and similarly, the 5 samples classified as ER negative were indeed ER negative.

# read true pathway activity class labels for data set samples
true_labels_df <- read.table("Sample_labels.txt", 
                             header=TRUE, sep = "\t")
# assess accuracy and generate confusion matrix for classification
confusion_matrix <- calculate_accuracy(true_labels_df, classes_df, 
                                       pathway = "ER")
#> [1] "Confusion Matrix for  ER  pathway"
#> [1] "--------------------------------------------------------------"
#>                      Actual Positive Actual Negative Actual Uncertain
#> Prediction Positive                4               0                0
#> Prediction Negative                0               5                0
#> Prediction Uncertain               6               5                0
#> [1] "--------------------------------------------------------------"
# detail breakdown of classification evaluation (accuracy, % classified etc)
confusion_matrix <- calculate_accuracy(true_labels_df, classes_df, 
                                       pathway = "ER", display_statistics = T)
#> [1] "Confusion Matrix for  ER  pathway"
#> [1] "--------------------------------------------------------------"
#>                      Actual Positive Actual Negative Actual Uncertain
#> Prediction Positive                4               0                0
#> Prediction Negative                0               5                0
#> Prediction Uncertain               6               5                0
#> [1] "--------------------------------------------------------------"
#> [1] "Statistics in Confusion Matrix "
#> [1] "--------------------------------------------------------------"
#> [1] "Proportion of classified samples: 45.00"
#> [1] "Accuracy amongst classified samples: 100.00"
#> [1] "True Positive(TP): 4"
#> [1] "True Negative(TN): 5"
#> [1] "False Negative(FN): 0"
#> [1] "False Positive(FP): 0"
#> [1] "--------------------------------------------------------------"
#> [1] "True Positive Rate(TPR)(sensitivity)(Recall): 100.00"
#> [1] "True Negative Rate(TNR)(specificity): 100.00"
#> [1] "Precision (Positive predictive value): 100.00"
#> [1] "False Positive Rate(FPR): 0.00"
#> [1] "False Negative Rate(FNR): 0.00"
#> [1] "--------------------------------------------------------------"
#>  Actual Positive Actual Negative Actual Uncertain
#>  Min.   :0.000   Min.   :0.000   Min.   :0       
#>  1st Qu.:2.000   1st Qu.:2.500   1st Qu.:0       
#>  Median :4.000   Median :5.000   Median :0       
#>  Mean   :3.333   Mean   :3.333   Mean   :0       
#>  3rd Qu.:5.000   3rd Qu.:5.000   3rd Qu.:0       
#>  Max.   :6.000   Max.   :5.000   Max.   :0

In addition to a detail summary of the classification assessment, we can also view a receiver operating curve (ROC) curve for our classification, which visualizes the sensitivity and specificity of the data set classification by fitting a logistic regression model to the binary ER pathway activity classifications (active, inactive):

# detail breakdown of classification evaluation (accuracy, % classified etc)
confusion_matrix <- calculate_accuracy(true_labels_df, classes_df, 
                                       pathway = "ER", display_statistics = T,
                                       display_roc_curve = T)
#> [1] "Confusion Matrix for  ER  pathway"
#> [1] "--------------------------------------------------------------"
#>                      Actual Positive Actual Negative Actual Uncertain
#> Prediction Positive                4               0                0
#> Prediction Negative                0               5                0
#> Prediction Uncertain               6               5                0
#> [1] "--------------------------------------------------------------"
#> [1] "Statistics in Confusion Matrix "
#> [1] "--------------------------------------------------------------"
#> [1] "Proportion of classified samples: 45.00"
#> [1] "Accuracy amongst classified samples: 100.00"
#> [1] "True Positive(TP): 4"
#> [1] "True Negative(TN): 5"
#> [1] "False Negative(FN): 0"
#> [1] "False Positive(FP): 0"
#> [1] "--------------------------------------------------------------"
#> [1] "True Positive Rate(TPR)(sensitivity)(Recall): 100.00"
#> [1] "True Negative Rate(TNR)(specificity): 100.00"
#> [1] "Precision (Positive predictive value): 100.00"
#> [1] "False Positive Rate(FPR): 0.00"
#> [1] "False Negative Rate(FNR): 0.00"
#> [1] "--------------------------------------------------------------"
#>  Actual Positive Actual Negative Actual Uncertain
#>  Min.   :0.000   Min.   :0.000   Min.   :0       
#>  1st Qu.:2.000   1st Qu.:2.500   1st Qu.:0       
#>  Median :4.000   Median :5.000   Median :0       
#>  Mean   :3.333   Mean   :3.333   Mean   :0       
#>  3rd Qu.:5.000   3rd Qu.:5.000   3rd Qu.:0       
#>  Max.   :6.000   Max.   :5.000   Max.   :0

In concordance with the statistical classification evaluation metrics, the ROC curve also shows that the samples were classified well according to ER pathway activity by the curve reaching the top left corner of the plot (indicating a high true positive rate and a low false positive rate). The nine points represent the nine samples classified as “active” / “inactive” for the ER pathway.

11 Session Info

The output of sessionInfo upon which this file was generated:

sessionInfo()
#> R version 4.1.3 (2022-03-10)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
#>  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
#>  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] ggplot2_3.3.5           PathAnalyser_0.0.0.9000 GSVA_1.42.0            
#> [4] BiocStyle_2.22.0       
#> 
#> loaded via a namespace (and not attached):
#>   [1] plyr_1.8.7                  lazyeval_0.2.2             
#>   [3] GSEABase_1.56.0             crosstalk_1.2.0            
#>   [5] BiocParallel_1.28.3         usethis_2.1.5              
#>   [7] GenomeInfoDb_1.30.1         digest_0.6.29              
#>   [9] htmltools_0.5.2             fansi_1.0.3                
#>  [11] magrittr_2.0.3              memoise_2.0.1              
#>  [13] ScaledMatrix_1.2.0          NCmisc_1.1.6               
#>  [15] limma_3.50.1                remotes_2.4.2              
#>  [17] Biostrings_2.62.0           annotate_1.72.0            
#>  [19] matrixStats_0.61.0          prettyunits_1.1.1          
#>  [21] colorspace_2.0-3            blob_1.2.2                 
#>  [23] xfun_0.30                   dplyr_1.0.8                
#>  [25] callr_3.7.0                 crayon_1.5.1               
#>  [27] RCurl_1.98-1.6              jsonlite_1.8.0             
#>  [29] graph_1.72.0                glue_1.6.2                 
#>  [31] gtable_0.3.0                zlibbioc_1.40.0            
#>  [33] XVector_0.34.0              DelayedArray_0.20.0        
#>  [35] pkgbuild_1.3.1              BiocSingular_1.10.0        
#>  [37] Rhdf5lib_1.16.0             SingleCellExperiment_1.16.0
#>  [39] BiocGenerics_0.40.0         HDF5Array_1.22.1           
#>  [41] scales_1.1.1                futile.options_1.0.1       
#>  [43] DBI_1.1.2                   edgeR_3.36.0               
#>  [45] Rcpp_1.0.8.3                viridisLite_0.4.0          
#>  [47] xtable_1.8-4                bit_4.0.4                  
#>  [49] rsvd_1.0.5                  stats4_4.1.3               
#>  [51] htmlwidgets_1.5.4           httr_1.4.2                 
#>  [53] RColorBrewer_1.1-3          ellipsis_0.3.2             
#>  [55] pkgconfig_2.0.3             XML_3.99-0.9               
#>  [57] farver_2.1.0                sass_0.4.1                 
#>  [59] locfit_1.5-9.5              utf8_1.2.2                 
#>  [61] tidyselect_1.1.2            labeling_0.4.2             
#>  [63] rlang_1.0.2                 reshape2_1.4.4             
#>  [65] AnnotationDbi_1.56.2        munsell_0.5.0              
#>  [67] tools_4.1.3                 cachem_1.0.6               
#>  [69] cli_3.2.0                   generics_0.1.2             
#>  [71] RSQLite_2.2.12              devtools_2.4.3             
#>  [73] evaluate_0.15               stringr_1.4.0              
#>  [75] fastmap_1.1.0               yaml_2.3.5                 
#>  [77] processx_3.5.3              knitr_1.38                 
#>  [79] bit64_4.0.5                 fs_1.5.2                   
#>  [81] purrr_0.3.4                 KEGGREST_1.34.0            
#>  [83] sparseMatrixStats_1.6.0     formatR_1.12               
#>  [85] proftools_0.99-3            brio_1.1.3                 
#>  [87] compiler_4.1.3              rstudioapi_0.13            
#>  [89] plotly_4.10.0               png_0.1-7                  
#>  [91] testthat_3.1.3              tibble_3.1.6               
#>  [93] bslib_0.3.1                 stringi_1.7.6              
#>  [95] highr_0.9                   ps_1.6.0                   
#>  [97] futile.logger_1.4.3         desc_1.4.1                 
#>  [99] lattice_0.20-45             Matrix_1.4-1               
#> [101] vctrs_0.4.0                 pillar_1.7.0               
#> [103] lifecycle_1.0.1             rhdf5filters_1.6.0         
#> [105] BiocManager_1.30.16         jquerylib_0.1.4            
#> [107] data.table_1.14.2           bitops_1.0-7               
#> [109] irlba_2.3.5                 GenomicRanges_1.46.1       
#> [111] R6_2.5.1                    bookdown_0.25              
#> [113] IRanges_2.28.0              sessioninfo_1.2.2          
#> [115] reader_1.0.6                lambda.r_1.2.4             
#> [117] assertthat_0.2.1            pkgload_1.2.4              
#> [119] rhdf5_2.38.1                SummarizedExperiment_1.24.0
#> [121] rprojroot_2.0.3             withr_2.5.0                
#> [123] S4Vectors_0.32.4            GenomeInfoDbData_1.2.7     
#> [125] parallel_4.1.3              VennDiagram_1.7.1          
#> [127] grid_4.1.3                  beachmat_2.10.0            
#> [129] tidyr_1.2.0                 rmarkdown_2.13             
#> [131] DelayedMatrixStats_1.16.0   MatrixGenerics_1.6.0       
#> [133] pROC_1.18.0                 Biobase_2.54.0

12 References

Hänzelmann, S., Castelo, R. & Guinney, J. GSVA: gene set variation analysis for microarray and RNA-Seq data. BMC Bioinformatics 14, 7 (2013). https://doi.org/10.1186/1471-2105-14-7

Symmans, W.F., Hatzis, C., Sotiriou, C., Andre, F., Peintinger, F., Regitnig, P., Daxenbichler, G., Desmedt, C., Domont, J., Marth, C. and Delaloge, S., 2010. Genomic index of sensitivity to endocrine therapy for breast cancer. Journal of clinical oncology, 28(27), p.4111. https://dx.doi.org/10.1200%2FJCO.2010.28.4273